import geopandas as gpd
import seaborn as sns
from libpysal import graph
from sklearn import cluster, preprocessingFind me a cluster
One solution. You may have yours.
chicago = gpd.read_file(
"https://martinfleischmann.net/sds/chapter_07/data/chicago_influenza_1918.geojson"
)Before working with clustering, do you remember that note about data standardisation? The demographic variables in the table are not using the same scale, so you need to do something about it before using K-means.
demographics = [
"gross_acres",
"illit",
"unemployed_pct",
"ho_pct",
"agecat1",
"agecat2",
"agecat3",
"agecat4",
"agecat5",
"agecat6",
"agecat7",
]
chicago[demographics] = preprocessing.robust_scale(chicago[demographics])
chicago.head(2)| geography_code | gross_acres | illit | unemployed_pct | ho_pct | agecat1 | agecat2 | agecat3 | agecat4 | agecat5 | agecat6 | agecat7 | influenza | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | G17003100388 | -0.194708 | 0.465558 | 0.121762 | -0.577117 | -0.145658 | -0.430586 | -0.320059 | -0.367526 | -0.419623 | -0.304649 | -0.397324 | 29 | POLYGON ((358405.051 570342.347, 358371.811 57... |
| 1 | G17003100197 | -0.201697 | 2.413302 | -0.701498 | -0.629427 | 0.627451 | 0.434924 | 0.408555 | 0.358749 | -0.009057 | -0.143917 | -0.205352 | 30 | POLYGON ((356903.353 580393.561, 356895.319 58... |
If you check the values now, you will see that they are all distributed around 0.
_ = sns.pairplot(chicago[demographics])
Pick a number of clusters
n_clusters = 4Run K-Means for that number of clusters
kmeans = cluster.KMeans(n_clusters=n_clusters, random_state=0, n_init=1000)
kmeans.fit(chicago[demographics])KMeans(n_clusters=4, n_init=1000, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, n_init=1000, random_state=0)
Plot the different clusters on a map
chicago["cluster"] = kmeans.labels_
chicago.explore("cluster", categorical=True)Analyse the results: - What do you find? - What are the main characteristics of each cluster? - How are clusters distributed geographically? - Can you identify some groups concentrated on particular areas?
groups = chicago.groupby('cluster')
groups.size()cluster
0 356
1 36
2 99
3 5
dtype: int64
groups[demographics].mean().T| cluster | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| gross_acres | 0.191129 | 9.231985 | 1.214300 | 28.355467 |
| illit | 0.136419 | 0.374769 | 2.098371 | 0.290736 |
| unemployed_pct | -0.185173 | 0.217657 | -0.190443 | 0.278034 |
| ho_pct | -0.004221 | 1.288033 | 0.097141 | 1.885749 |
| agecat1 | -0.130173 | 0.179272 | 1.498118 | 0.956863 |
| agecat2 | -0.146134 | 0.290070 | 1.586965 | 1.138612 |
| agecat3 | -0.157950 | 0.306211 | 1.508939 | 1.128909 |
| agecat4 | -0.168607 | 0.164381 | 1.276251 | 0.918705 |
| agecat5 | -0.153945 | -0.101216 | 1.146712 | 0.677434 |
| agecat6 | -0.072975 | -0.119848 | 0.921775 | 0.542829 |
| agecat7 | 0.006919 | -0.171417 | 0.789859 | 0.589063 |
Create spatially lagged K-Means.
queen = graph.Graph.build_contiguity(chicago).transform("r")
for column in demographics:
chicago[column + "_lag"] = queen.lag(chicago[column])
demographics_spatial = demographics + [column + "_lag" for column in demographics]
kmeans_lag = cluster.KMeans(n_clusters=n_clusters, random_state=42, n_init=1000)
kmeans_lag.fit(chicago[demographics_spatial])
chicago["cluster_lag"] = kmeans_lag.labels_
chicago.explore("cluster_lag", categorical=True)Develop a regionalisation using agglomerative clustering
agg = cluster.AgglomerativeClustering(n_clusters=n_clusters, connectivity=queen.sparse)
agg.fit(chicago[demographics])AgglomerativeClustering(connectivity=<496x496 sparse array of type '<class 'numpy.float64'>'
with 2274 stored elements in Compressed Sparse Row format>,
n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(connectivity=<496x496 sparse array of type '<class 'numpy.float64'>'
with 2274 stored elements in Compressed Sparse Row format>,
n_clusters=4)chicago["cluster_agg"] = agg.labels_
chicago.explore("cluster_agg", categorical=True)Generate a geography that contains only the boundaries of each region and visualise it.
regions = chicago[["cluster_agg", "geometry"]].dissolve("cluster_agg")
regions.reset_index().explore("cluster_agg", categorical=True)